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Abstract: It has been widely realized that Monte Carlo methods (approxi- 
mation via a sample ensemble) may fail in large scale systems. This work offers 
some theoretical insight into this phenomenon in the context of the particle 
filter. We demonstrate that the maximum of the weights associated with the 
sample ensemble converges to one as both the sample size and the system di- 
mension tends to infinity. Specifically, under fairly weak assumptions, if the 
ensemble size grows sub-exponentially in the cube root of the system dimen- 
sion, the convergence holds for a single update step in state-space models with 
independent and identically distributed kernels. Further, in an important spe- 
cial case, more refined arguments show (and our simulations suggest) that the 
convergence to unity occurs unless the ensemble grows super-exponentially in 
the system dimension. The weight singularity is also established in models with 
more general multivariate likelihoods, e.g. Gaussian and Cauchy. Although pre- 
sented in the context of atmospheric data assimilation for numerical weather 
prediction, our results are generally valid for high-dimensional particle filters. 



1. Introduction 

With ever increasing computing power and data storage capabilities, very large 
scale scientific analyses are feasible and necessary (e.g. [8]). One important appli- 
cation area of high-dimensional data analysis is the atmospheric sciences, where 
solutions to the general (inverse) problem of combining data and model quanti- 
ties are commonly required. For instance, to produce real-time weather forecasts 
(including hurricane and severe weather warnings), satellite radiance observations 
of humidity and radar backscatter of sea surface winds must be combined with 
previous numerical forecasts from atmospheric and oceanic models. To such ends, 
recent work on numerical weather prediction is cast in probabilistic or Bayesian 
terms [10, 21, 26], and much focus in the literature on the assimilation of data and 
numerical models pertains to the sampling of high-dimensional probability density 
functions (pdf) [1, 18, 25, 27]. Motivated by these sampling techniques, we inves- 
tigate the dangers of naively using Monte Carlo approximations to estimate large 
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scale systems. Specifically, in the context of the particle filter, we show that accurate 
estimation of (truly) high-dimensional pdfs require ensemble sizes that essentially 
grow exponentially with the system dimension. 

Some recent work in numerical weather prediction has extended Kalman filter 
solutions to work efficiently in Gaussian systems with degrees of freedom exceeding 
10 6 . One popular extension is given by the ensemble Kalman filter, a Monte Carlo 
based filter version which draws samples from the posterior distribution of the at- 
mospheric state given the data and the model [6, 11]. However, the task of sampling 
in real-time from such high-dimensional systems is conceptually non-trivial: com- 
putational resources limit sample sizes to several orders of magnitude smaller than 
the system dimension. To address sampling errors associated with small ensembles, 
various approaches leverage sparsity constraints to attenuate spurious correlations 
[15, 17, 25]. Moreover, in the Gaussian case, for systems with a finite number of 
dominant modes, moderate sample sizes are sufficient to accurately estimate pos- 
terior means and covariances [12]. 

For longer forecast lead times, the involved dynamical models exhibit strongly 
non-linear behavior and produce distinctly non-Gaussian forecast distributions (e.g. 
Figure 2, [■'>]). In these situations, optimal filtering requires the use of more fully 
Bayesian filtering methods to combine data and models. In the context of oceano- 
graphic data assimilation, one such approach is considered by [27], who proposes a 
sequential importance sampling algorithm to obtain posterior estimates of oceanic 
flow structures. This method falls within the set of procedures typically referred to 
as particle filters (e.g. [!-)]). Based on a finite set of sample points with associated 
sample- weights, the particle filter seeks to propagate the probability distribution of 
the unknown state forward in time using the system dynamics. Once new data is 
available, Bayes theorem is used to re-normalize the weights based on how "close" 
the associated sample points are to the data. 

Although successfully applied to a variety of settings, particle filters often yield 
highly varying importance weights and are known to be unstable even in low- 
order models. Remedies to stabilize the filter include re-sampling (re-normalizing) 
the involved empirical measure at regular time intervals [14, 19], marginalizing 
or restricting the sample space [20, 22], and diversifying the sample (e.g. [13]). 
However, these approaches serve to improve filter performance in low-dimensional 
systems, but do not fundamentally address slow convergence rates when the particle 
filter is applied in large scale systems. In particular, as noted by e.g. [27] and 
[I], when applied to geophysical models of high dimension, sequential importance 
sampling collapses to a point mass after a few (or even one!) observation cycles. 
To shed light on the effects of dimensionality on filter stability, our work provides 
necessary sample size requirements to avoid weight degeneracies in truly large scale 
problems. 

This work is outlined as follows. The next section formulates the problem of using 
ensemble methods for approximation purposes in large scale systems, and provides 
motivating simulations illustrating the potential difficulties of high-dimensional es- 
timation. Our main developments are then presented in Section 3 where we give 
general conditions under which the maximum of the sample weights in the (likeli- 
hood based) particle filter converges to one if the ensemble size is sub-exponential in 
the cube root of the system dimension. The convergence is established in a Gaussian 
context, but is also argued for observation models with independent and identically 
distributed (iid) kernels. The validity of the weight collapse in the case when the 
ensemble grows sub-exponentially in the system dimension (as suggested by the 
simulations) is discussed as an extension to the multivariate Cauchy kernel and 
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its apparent slower collapse. In Section 4, for completeness, in a Gaussian context, 
we show that the particle filter behaves as desired if the ensemble size is super- 
exponential in the system dimension. A brief discussion in Section 5 concludes our 
work. 



2. Model setting and motivation 

This section gives the model setting and describes the Monte Carlo estimation prob- 
lem under consideration. To illustrate the difficulty of high-dimensional estimation, 
we also provide motivating examples that describe the weight singularity. 



2.1. Model setting 

The statistical context in which we motivate our work is as follows. Consider a set 
of n sample points X = {X\, . . . , X n }, where X% £ and both the sample size 
n and system dimension d are "large". We assume that the sample X is drawn 
randomly from the prior (or proposal) distribution p(X). New data Y is related 
to the state X by the conditional density p(Y\X). For concreteness, a functional 
relationship Y = f(X) + e is assumed, and e is taken to be independent the state 
X. The goal is to estimate posterior expectations using the importance ratio: i.e., 
for some function /i(-), we want to estimate 

as an estimator. Based on this formulation, the weights attached to each ensemble 
member 

(1) P ^ 



and use 



1 EU^m 

are the primary objects of our study. As mentioned, in large scale analyses, the 
weights in (1) are highly variable and often produce estimates E(-) which are col- 
lapsed onto a point mass with max(wi) ss 1. For high-dimensional systems, this 
degeneracy is pervasive and appears to hold for a wide variety of prior and likeli- 
hood distributions. 

Next we illustrate the degeneracy of the sample weights as the dimension of X 
and Y grows large. 



2.2. Motivating examples 

To illustrate the weight collapse of the particle filter we simulate weights from (I) for 
both a Gaussian and a Cauchy distributional setting. These densities were chosen 
to parallel the work of [27], which attempts to address particle filter collapse by 
modeling the observation noise using a heavy-tailed distribution. 

In our simulations, the d x 1 observation vector Y is related to the unobserved 
state variable X through the model Y = HX + e. Here, the observation operator 
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H is set equal to the d x d identity matrix, denoted Id, and the proposal (i.e. 
forecast) distribution is taken as zero-mean Gaussian with covariance cov(X) = Id, 
denoted X ~ N(0, Id) - These choices for H and cov(X) allow us to straightforwardly 
manipulate the system dimension, and to study the behavior of the maximum 
weight for various choices of d and n. For the Gaussian case we let e ~ N(0,Id), 
while for the Cauchy case we investigate two scenarios. First, the components of 
e = [ei, . . . , ed] T are taken as iid Cauchy, where each component has location and 
scale parameters set equal to zero and one, respectively Second, e is taken as 
multivariate Cauchy, again with location and scale parameters equal to zero and 
one. 

In each simulation run of the maximum weight, an observation Y is drawn from 
p(Y) and a random sample of n particles X\, . . . ,X n , is obtained from p(X) = 
N(0, Id)- Then, given Y, the sample particles are re-weighted according to (1) and 
the maximum weight, denoted u>( ra ), is determined. To evaluate the dependence of 
w/ n ) on d and n, we vary the system dimension at three levels, and let the Monte 
Carlo sample size increase polynomially in d at a rate of 2.5, i.e. we set n — d . 
For the Gaussian setting, we let d = 10, 50, 100 and obtain n = 316, 17676, 100000, 
while, for the Cauchy settings, where convergence to unity was observed to be 
slower, the maximum dimension is increased to d = 400. Thus, for the Cauchy 
simulations, we set d = 10,50,400, and get n = 316,17676,3200000. Our setup 
results in nine simulation sets each for the three distributional settings, and each 
set is based on 400 independent draws of W(n) ■ 

Histograms of the maximum weight W( n ) f° r the Gaussian setting are displayed 
in Figure 1. The effect of changing the dimension is represented column- wise, and 
the effect of changing the Monte Carlo sample size is given row-wise. Each plot also 
depicts the corresponding sample mean maximum weight E(w^) as a vertical line. 



IID Gaussian 




0.5 1 
d=100 



Fig 1 . Sampling distribution of v)( n ) f or the Gaussian case. The nine plots show histograms of 
f(n) with system dimension varied column-wise (d = 10, 50, 100j and sample size varied row-wise 
(n = 10 2,5 ,50 2 ' 5 , lOO 2,5 ^. The vertical line in each plot depicts E{w^). Each histogram is based 
on 400 simulations. 
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Fig 2. Sampling distribution of for e iid Cauchy (top panel), and e multivariate Cauchy 

(bottom panel). In each panel, the system dimension d is varied column-wise (d = 10, 50, 400j, 
and the sample size is varied row-wise (n = 10 2 5 ,50 2 5 ,400 2 - 5 j. The vertical line in each plot 
depicts E(wr n -\). Each histogram is based on 400 simulations. 



As indicated, for a fixed sample size (i.e. within a row), the distribution of wr n \ is 
dramatically shifted to the right, and we see that wr n -\ approaches unity as d grows. 
Moreover, the same singularity is evident along the diagonal from the lower left 
(d = 10, n = 316) to the upper right (d = 100, n — 100000) histogram. Hence, even 
as n grows at a polynomial rate in d, w< n -\ still approaches unity. 

The histograms in the two panels of Figure 2 show the simulation results for the 
Cauchy settings. As in the previous figure, the effect of changing the dimension is 
given column-wise, and the effect of varying the ensemble size is given row- wise. Also 
depicted is the sample mean maximum weight E(w( n )) as a vertical line. For the 
iid Cauchy case, displayed in the top panel, for fixed n, the sampling distribution of 
W( n ) is clearly shifted to the right as d increases. Moreover, similarly to the Gaussian 
results, the weight singularity is again evident in the histograms along the diagonal 
where the sample size grows as n = d 2 5 . As with the iid setting, the histograms 
of iu( n ) for the multivariate Cauchy case (bottom panel) also demonstrate weight 
collapse. However, as can be seen, the convergence is slower. The reasons for the 
apparent slower collapse will be discussed in Section 3.3. 
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To shed light on the illustrated weight collapse, we next present a formal study 
of the behavior of the maximum importance weight. Specifically, the next section 
develops conditions under which the maximum weight wr n ) converges to unity. 

3. The singularity of the maximum weight for models with iid kernels 

In this section, we make precise the reasons for the previously described weight 
collapse. The primary focus is on situations where the likelihood p(Y\X) is based 
on iid components (or iid blocks) and the proposal distribution p{X) is Gaussian. 
Our basic tool, given below in Lemma 3.1, gives reasonable conditions under which 
the maximum weight based on the general form in (1) converges to unity as d and 
n grow to infinity. 

Our main insight is that, for large d, p(Y\Xi) is often well approximated by the 
form 

(2) p(Y\Xi) ~ exp{-(pd + aVdZi)}(l + o p (l)), 

where Zi follows the standard normal distribution and where \x and a are positive 
constants. The justification of this form is highlighted by the developments in Sec- 
tions 3.1 - 3.3, and the conditions under which (2) produces weight collapse are 
made precise below in Lemma 3.1. Note that, with Z^ representing the i:th order 
statistic from an ensemble of size n, the maximum weight w/ n \ behaves as 

1 

W( n \ ~ = . 

1 ' l + Y^_ 2e -^Vd(z w -z w ) 

Thus, to establish weight collapse, it suffices to show that the denominator in the 
above expression converges to unity for large d. The following lemma gives the 
strong version of (2), which is needed for our conclusion, and formalizes the con- 
vergence. 

Lemma 3.1. Let Si, i = 1, . . . , n, be independent random variables with cumulative 
distribution function (cdf) Gd(-) satisfying 

G d (s) = (l + o(l))<S>(s) 

over a suitable interval, where $(•) is the cdf of a standard normal distribution. 
Specifically, we assume there exist two sequences a n d and a 1 ^ d , where both a l n d 
and ca^d tend to as n and d go to infinity, such that, for s 6 [—d ri ,d ri ] with 
< rj < 1/6, we have 

(3) (1 + < G d (s) < (1 + < d )$(s). 

Let S(y\ < • • ■ < Si n ) be the ordered sequence of S n , . . . , S n , and define, for some 
<J>0, T n , d = YJU e-°^ s ^~ s ^\ Then, if ^ 0, 

E(T n , d \S (1) ) = o p (l). 

A proof of the lemma is provided in the Appendix. 

An immediate implication of this result is weight collapse, i.e. if — > 0, we 
have wr n ) L 

With the normality condition (3) valid for r) < |, our conclusion would hold, 
effectively, whenever — > 0; rather than, as shown above, for ^ 0. Unfor- 
tunately, unless Gd(s) = $(s), our proof is valid only for rj < However, using 
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more refined arguments in conjunction with Lemma 2.5 of [23] one can show that, 
under the conditions of Lemma A.l (see Appendix), only ^f/r — ^ is required 
for collapse. Furthermore, using specialized arguments, we can show that in the 
Gaussian prior-Gaussian likelihood setting, — > is the necessary condition for 
collapse. We do not describe these arguments in detail, but focus instead on Lemma 
3.1 which can be applied more broadly. 

We next turn our attention to high-dimensional, linear Gaussian systems. 



3.1. Gaussian case 



We assume a data model given by Y — HX + e, where Y is a d x 1 vector, H is a 
known d x q matrix, and X is a q x 1 vector. Both the proposal distribution and 
the error distribution are Gaussian with p(X) = N(fix, ^x) and p(e) = X(0, E e ), 
and the noise e is taken independent of the state X . Since the data model can be 
pre-rotated by E s , we set E e = without loss of generality (wlog). Moreover, 
since EY = EHX, we can replace X, by (X, - EX,) and Y by (Y - EY) and 
leave p(Y\X) unchanged. Hence, wlog we also set fix = 0. Moreover, define, for 
conformable A and B, the inner product (A, B) = A T B (where the superscript T 
denotes matrix transpose), and let ||A|| 2 = (A, A). 

With p(Y\X) ~ N(HX,Id), the weights in (1) can be expressed as 

(4) Wi 



Y.Uexpi-WY-HX^/Z) 

To establish weight collapse for high-dimensional Gaussian p(Y\X) and p(X), 
we show that, under reasonable assumptions, the exponent in (4) satisfies the con- 
ditions of Lemma 3.1. 

Let d' = rank(H). With A 2 , . . . , A^, the singular values of cov(HX), define the 
d! x d! matrix D = diag(Xi, . . . , A^). Then, with Q an orthogonal matrix obtained 
by the spectral decomposition of cov(HX), define the d' X 1 vector V such that 

Q T HX = DV. 

Note that V% corresponding to Xj is N(Q,Id')- Since Q is orthogonal, we can write 

d' d 

(5) \\Y-HX l \\ 2 = \\Q T Y~DV l \\ 2 = Y, X2 M+ E 4' 

j=l j=d' + l 

where, conditional on Y, [Wn, . . . , Wid>] T is X(£,/ d /), and where e j is the j':th 
component of the observation noise vector e. The mean vector £ = [hi, . . . , /J,d'] T is 
given by 

(6) f = D~ X Q T Y = V + D"V ', 

where V and e' are independent N(Q,Id>). 
Now, for i = 1, . . . , n, define 



(7) = 



E-^AK^.-q + zif)) 

(2 Eli ^(1 + 2^)) 



1/2 



Note that the second term in (5) is constant for every X, and will not influence 
the weight Wi- 
We assume, 
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Al. There is a positive constant S such that 8 < Ai, • • • , \d* < i; and 
A2. <4 = £ EiiA J 4 (l + 2 Mj 2 )^ CT 2 >0 _ 

With these assumptions, Lemma A. 2 of the Appendix establishes that the uni- 
form normality condition given by (3) holds for the standardized terms defined in 
(7). The result is valid for any < r\ < 1/6 and is based on Theorem 2.5 of [23]. 

Note that, from (6) and (7), we can write 

d' 

\\Y - HX t \\ 2 cx aVd/Si(l + o(l)) + A ?(! + ^ 2 ), 

3=1 

where o(l) is independent of the subscript i. 

We can now state the following proposition. 
Proposition 3.2. Under assumptions Al and A2, if — > with < r\ < 1/6, 
p 



we 



have W(„) — > 1 



Proposition 3.2 follows by Lemma A. 2 (Appendix) and Lemma 3.1. 

The above result implies that, unless n grows super-exponentially in d' 1 / 3 , we 
have weight collapse. However, as we show in Section 4, the weight singularity is 
avoided when d' is bounded, or, more generally, when log n/d' — > oo. To establish 
the exact boundary of collapse in the Gaussian setting, a closer analysis of the 
chi-squared distribution (c.f. \\Y — HXi\\ 2 ) with d' degrees of freedom is needed. In 
particular, using the Poisson sum formula for the tails of the gamma distribution, it 
can be argued that collapse occurs if log n/d' — ► 0. Essentially, with Gd{s) = $(s) 
and for suitable a 2 , we can show 



E(T n , d \Sw) ~ Y^r- 

in probability. As mentioned, we do not describe the specifics of this argument, but 
focus instead on the more general result of Lemma 3.1. 

Next we turn our attention to settings where both the observation and state 
vectors consist of iid components. 

3.2. General iid kernels 

It may be speculated that the weight singularity can be ameliorated by the use 
of a heavy-tailed kernel. However, we argue that, as long as the components of 
the observation noise are iid and H is the identity matrix, we still expect weight 
collapse for large d. 

The model setting is Y = X + e. Let A^ be the fc:th component of Xi. We take 
Xik iid with common density </(•), and take the components of e = [e±, . . . , ed\ T iid 
with common density /(•). Then, with ip(-) = log/(-), given Y = [yi, . . . ,yd] T ', we 
can write the weights from (1) as 

e:c P(Ej=i ^{Vi ~ X ij)) 
Wi = — — 2 • 

ELi ex p( i>{vj - x kj)) 

Define V tJ = ip(y 3 - X tJ ) , and let fifa) = E{V tj \yj) and a 2 (y 3 ) = E{V 2 \yj) - 
fi 2 (yj), where the expectations are evaluated under /(■). With these quantities, let 

1 (EU^)) 1/2, 
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for i = 1, . . . , n. Again, let Sm < ■ ■ ■ < SV„) be the ordered sequence of Si,S2, ■ ■ ■ , 
S n . Now, in analogy to Proposition 3.2: if -t§^ — > 0, and 

(i) the sequence {Vij — At(%-)}j=i satisfies the conditions of Theorem 2.5 of [23]; 
and 

(ii) ^E-=i* 2 (%)^ 2 >°> 

then the maximum weight W( n ) converges in probability to 1. 

To verify the normality approximation in (2) for Si, it is easy to check that 
<t 2 = Ea 2 (yi) < oo, and max^ jvKj/j — = Opid 1 ^ 2 ) uniformly in i. The next 

proposition gives checkable, albeit strong conditions, leading to weight collapse. 
Proposition 3.3. Let the components of Xq = [Xox, . . . ,Xod] T be iid with density 
<?(■), and let the components of e — [ei, . . . , £d] T be iid with density /(■). Set Y = 
Xq + e. Let X%, . . . , X n be iid vectors, each with iid components Xn~ with common 
density g(-). Assume that f,g are such that 

-Xy)] < 00, 

for \t\ < 5 with 5 > 0. Then, with T n>d = Y J n e=2 e~ a ^ s ^- s w\ if ^ for 
< T) < 1/6, we ftawe, 

£(T„ id |S (1) ) £ 0. 

The result follows from Lemma 3.1 and Lemma A.l. Note that most common 
/, g, e.g. the Gaussian and Cauchy, satisfy the conditions of Proposition 3.3. 

If H is not the identity, our conclusion may still hold for ^ ^ o. For general 
H, set Ui — HXi and let Uik be the fc-th component of L/j. With this quantity, 
provided the regularity conditions are satisfied, we may apply Theorem 3.23 of [23] 
to the terms 

„ _ Y.'! i '■ r -,- 1 

S ^ " o~Td ' 

where n(yj) and a are suitable constants. 

Next we turn our attention to the case when the observation noise vector follows 
the multivariate Cauchy distribution. 



3.3. Multivariate Cauchy case 

Our developments so far have considered settings where p(Y\X) is of multiplica- 
tive form; in particular, model settings with iid likelihood kernels (or iid blocks 
of observations). We now discuss extensions of our results to include multivariate 
likelihood functions. Except for the multivariate Gaussian case, however, which can 
be addressed by rotation (see Section 3.1), we note that no general result exists 
that addresses a wide range of multivariate likelihood functions. Here, we focus on 
the multivariate Cauchy distribution, which, in the context of oceanographic data 
assimilation, was proposed by [27] to avoid weight collapse. 

We still entertain the data model Y = HX + e and, as in the previous section, 
restrict H = Id- Here, we let p(X) = N(0, Id) and take the noise vector e to follow 
the multivariate Cauchy distribution. We note that the multivariate Cauchy distri- 
bution is equivalent to the multivariate t-distribution with ldf (e.g. [2], page 55). 
Then, given data Y and a sample Xi ~ N(0,ld), i = 1, . . . ,n, the multivariate 
Cauchy weights are given by 

(l + WY-X^)-^ 

Wl E"=i(i + r-^ll 2 )-^' 
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Now, with \\Y— X[i]\\ < ■ ■ ■ < ll^~ A[„]|| the order statistics, we express the weights 
in familiar form, i.e. 

1 + £ exp (- *±± (i og( i + ||y _ X[j] || 2 ) - log(l + \\Y X {1] || 2 ))) 

To argue w/ n ) ~ 1 for large d, we note first that the scenario considered here is 
closely related to the Gaussian prior-Gaussian likelihood case. 
Proposition 3.4. Let X be zero-mean Gaussian with cov(X) = Id and take e to 
follow the multivariate Cauchy distribution. Then, with Y = X + e, as d — > oo, we 
get 

, , , l ( d d , 

p{X\Y)^N[Y—,{l- w )I d 



The convergence in Proposition 3.4 is in the sense that the finite dimensional 
distributions on both sides converge to the same limit. Thus, we reach the somewhat 
surprising conclusion that the posterior distribution of X given Y is Gaussian, but 
with parameters depending on the data. The result is proved in the Appendix. 

In the Appendix we also detail the heuristic argument that, as in the Gaussian- 
Gaussian case, the non-central x^(<7 2 c£) distribution behaves sufficiently like N((a 2 + 
l)d, 2(1 + 2a 2 )d) to permit exact replacement in our developments. We then reach 
the conclusion of slower weight collapse for large n and d, as substantiated by 
our simulations. Specifically, we argue that the average rate needed for collapse is 

For the Gaussian setting, the next section shows that weight collapse can be 
avoided for sufficiently large ensembles 



4. Consistency of Gaussian particle filter 

As a complement to the developments in the previous section, we provide a con- 
sistency argument for the type of estimators of E(h(X)\Y) that are under consid- 
eration here. The developments are made in a Gaussian context and we consider 
settings where both d and n are large. Specifically, if logn/d — > oo, we show con- 
sistency of X)"=i w ih(Xi) as an estimator of E(h(X)\Y). 

Suppose we have a random sample {Ao, X\, . . . , A„}, where Xi ~ A(0, Id)- As in 
the simulation section, let the data lo be collected through the model Y"o = +£, 
where e - N(0, I d ) is independent of X t . W\ihp{Y\X) = A(0, I d ), let {w x , w n } 
be the weights obtained by (1). 

Now, choose X* from {Ai, . . . , X n } with probabilities proportional to {u>i, . . . , 
w n }. Then, with #(•) representing the delta function and p(X\Yo) = Yl^—i WiS(Xi), 
we have X* ~ p(X\Yq). Further, the expectation of h(X*) (under the empirical 
measure) is given by E* (h(X*)\Yo) = Y^j=i w ih(Xi). With this setup, the following 
result establishes consistency of Y^i=i w ih{Xi) as an estimator of E(h(X)\Y). 
Proposition 4.1. Let h(-) be a bounded function from Jf d to Jf. Define E*h(X*) = 
w ih(Xi), the expectation under the (previously defined) empirical measure 
P~(X\Yq), and let Ei(-) denote expectation evaluated under the (true) posterior 
p(X\Yq). Then, if logn/d — > oo, 

\E*h(X*)- Eih{X)\ -^-> 0. 
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The result is proved in the Appendix. We note that Proposition 4.1 is valid for fi- 
nite dimensional measures. The next corollary follows as an immediate consequence 
of the established convergence. 

Corollary 4.2. Let {X*,...,X*} be a random sample from p*(X\Yo). Further, 
with k fixed, let v(x\, . . . , x* k ) be any random variable depending on the first k 
coordinates of X* . Then, with 5 V {-) representing the delta function for as 
\ogn/d — > oo, 



The results extend to include Y = HX + e, where rank(H) = d! < oo. Of 
course, p(Y\X) and p(Y) have to be changed in the developments to accommodate 
arbitrary H, but we again have no collapse provided logn/d' — ► oo. 

5. Discussion 

The collapse of the weights to a point mass (with max(uii) « 1) leads to disastrous 
behavior of the particle filter. One intuition about such weight-collapses is well 
known, but here made precise in terms of d and n: Monte Carlo does not work if 
we wish to compute d-dimensional integrals with respect to product measures. The 
reason is that we are in a situation where the proposal distribution p(X) and the 
desired sampling distribution are approximately mutually singular and (essentially) 
have disjoint support. As a consequence, the density of the desired distribution at 
all points of the proposed ensemble is small, but a vanishing fraction of density 
values predominate in relation to the others. 

Our developments demonstrate that brute-force-only implementations of the par- 
ticle filter to describe high-dimensional posterior probability distributions will fail. 
Our work makes explicit the rates at which sample sizes must grow (with respect 
to system dimension) to avoid singularities and degeneracies. In particular, we give 
necessary bounds on n to avoid convergence to unity of the maximum importance 
weight; and, naturally, accurate estimation will require even larger sample sizes than 
those implied by our results. Not surprisingly, weight degeneracies have been ob- 
served in geophysical systems of moderate dimension ([1, 3]; also, CSnyder/NCAR 
& T. Hamill/NOAA, personal communication, 2001). The usual manifestation of 
this degeneracy are Monte Carlo samples that are too "close" to the data, quickly 
producing singular probability measures, in particular as the filter is cycled over 
time. 

The obvious remedy to this phenomenon is to achieve some form of dimension- 
ality reduction, and the high-dimensional form in which the data are presented is 
typically open to such reduction with subsequent effective analysis. For instance, 
in the case of the ensemble Kalman filter, by imposing sparsity constraints through 
spatial localization ([15, 17]; see also, [12]). Be that as it may, as shown in this 
work, for fully Baycsian filter analyses of high-dimensional systems, such reduc- 
tion becomes essential lest spurious sample variability is to dominate the posterior 
distribution. 

In the context of numerical weather prediction, one approach to dimension re- 
duction may be to condition sample draws on a larger information set. One idea is 
given by [4] , who constructs proposal distributions by incorporating dynamic infor- 
mation in a low-order model. Other examples of geophysically constrained sampling 
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schemes are given by Bayesian Hierarchical Models (e.g. [16, 28]), but require com- 
putationally heavy, chain-based sampling and thus do not extend in any obvious 
manner to real-time applications. A more viable possibility for real-time applica- 
tions in a large scale system is to break the system into lower-dimensional sets, 
and then sequentially perform the sampling as in [3]. Another approach may be 
to condition the draws on lower-dimensional sufficient statistics ([5, 24]). A novel 
idea to improving convergence in the Baysian filter context is considered by [29] 
in the context of image matching and retrieval. For the purpose of validating the 
exclusion of parts of the sample space which appear uninteresting given the data, 
and to speed up the algorithm, they use information theory to restrict the sample 
space by explicitly incorporating (drawing) samples of low probability. 

Appendix 

We first introduce two lemmas that pertain to uniform normal approximations of 
the distribution of independent sums. Such sums appear in the formulation of the 
filter weights throughout our work. Valid for moderately large deviations, the first 
result (Lemma A.l) is a combination of Theorem 2.5 and Theorem 1.31 in [23] and 
is stated here without proof. The next result (Lemma A. 2) is given for the purpose 
of verifying the Lyapunov quotients conditions appearing in Lemma A.l. 
Lemma A.l. Let £i,...,£<j be independent random variables with E^j = and 
a) = Var($). Set 

S«i=-^-(& + •••+&), 

no- 
where B 2 = ^2j = i cTj, and define the Lyapunov quotients 

If B d = ad 1 / 2 ^ + o(l)), T d = cB d (some <r,c > 0), and L k , d < M/t^~ 2 , k = 
3, 4, ... , then the cdf of S d , denoted G d {-), satisfies (some C > 0) 

< C\x\ 3 d 1/2 , -dP<x<0, ?7 < 1/6; 

and the survival function, G d (-) = 1 — G d (-), satisfies 

<C\x\ 3 d 1/2 , 0<x<d v , 77 < 1/6. 

Thus, under the outlined conditions, the uniform normal approximations 

G d (s) = (l + o(l))<f>(s), and G d (s) = (1 + o(l))*(«) 

hold, respectively, over the intervals [— T?,0] and [0,r?] for -q < 1/3. 

The lemma is the basis for the normality conditions of Lemma 3.1. We note that 
the condition on Lk yd is satisfied in the iid case if Ee^ j < oo, for |t| < 6,5 > 0. 
Lemma A.2. Let Z 3 ,j = 1, . . . ,d, be iid N(0,1). Let < 5 < Ai < • ■ ■ < X d < § 
and fii, . . . , pt d be such that 

2 d 

(8) <m < a 2 d = -J2\^{l + 2fi 2 ) < M < oo, 



G d (x) 
<S>(x) 



G d (x) 
${x) 
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for some constants m,M. Then, with a 2 d as defined in (8), 
d 

(9) cr- k/2 £ XfE\{Z, + H f (1 + tff < C*fc!<r^, 

3=1 

for all k = 3, 4, . . . and a universal constant C = C(S, m, M). 

The remainder of the Appendix is devoted to the proofs of the presented lemmas 
and propositions. The proofs are given in the order in which the results appear in 
the text. 



Proof of Lemma 3. 1 

Let Sj (j = 1, . . . , n) be as defined in the lemma. Before proceeding, we note some 
important results/facts. First, it is well known that 

,io, 

where U has a Gumbel distribution (see, e.g., [7], p. 475). Second, by assumption, 

^ 2 (i° S ~~ * f° r < 77 < 1/6, which allows us to use the normal approximation 
defined in (3). Third, we make frequent use of Mill's ratio: i.e., ~ ^il, as 

x — ► +00. 

Now, note that 

( n ~ l ) Js^ ~ ^^(z - S {l) ))dG d (z) 

(11) S(T n)d |5 (1) ) = ^ - — , 

since, given 5(1), the remaining (n— 1) observations are iid with cumulative density 
function (cdf) equal to G d (z)/G d (Sm), z > Sny By (10) and (3) we see that the 
denominator of the right hand side converges to 1 in probability. To evaluate the 
numerator of (11), we break the integral into two parts: the first part yields the 
integral from 5m to d v , and the second part yields the tail integral from dP to 00. 
We now show that both integrals converge to in probability. 

For the first part, using integration by parts (twice) along with the approximation 
in (3), we can write 

d" 

exp( - aVd(z - S^)))dG d (z) 

S(i) 

= G d {dP)ex P { - crVd(d>i - S {1) )) - G d (S {1) ) 
f d " 

+ / exp( — o-Vd(z — S(i)))Gd(z)dz 
Js (1) 

= {G d {dP) - ${d?))exp{ - aVd(d<i - 5 ( i))) - (G d (S (1) ) - $(5 (1) )) 
,d" 

+ (l + o(l)) / exp(-aVd(z-S( 1 )))<f>(z)dz. 
Js {1) 

Now, again by assumption, 

(12) {n-^Gdid^exp^ aVd(d ri - S {1) )) < (n- l)G d (<F> ) ~ (n- l)$(d") = o(l). 
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Analogously, (n — l)<&(dP)exp( — a^/d(d v — S(i))) = o p (l). Hence, we have derived 
that 

(13) (n - l)(G d (tf ) - #(d"))«Ep( - aVd(<P - = o p (l). 
Further, by (10) and (3), 

(14) (n - l)(G d (5 (1) ) - $(5 (1) )) = o(l)(n - l)G d (S (1) ). 

Let 5 be an iid copy of Si. Then Gd(S) is uniformly distributed on [0,1], and 

(15) E(G d (S {1) )) = E(P(S < Si, . . . , S < S n \S)) = E(([G d (S)] n ) = -ly. 
Combining (14) and (15) yields 

(16) (n-l)(G d (S (1) )-^(S il) ))=o p (l). 

Moreover, by assumption a\fd + S(i) ~ a\fd — \J2 log?i — > oo. Thus, we get, again 
in light of (10), 

f d '' 

(n — 1) / exp( — aV~d(z — S(x)))4>(z)dz 
Js (1) 

< (n - 1) exp(crVd5 ( i) + cr 2 d/2)i>(S ( i) + <r\/d) + Vd(n - 1)$(<F) 
(n-l)exp(aVdS (1] +a 2 d/2)^S {1) +aVd) ^ lW<f?) 

(7i-l) exp(-52 /2) 

= - — -= — - + Vd(n - 1)$ (dP) 

V2^(5 ( i) + aVrf) 

= o p (l), 

The implication of the last inequality, in conjunction with (13) and (16), yields 

(n-1) / ea;p( — aVd(z - S^tydG^z) = o p (l). 
Js [1} 

By assumption, the second part of the integral in (11) is bounded by 

aVd(n - l)G d {dP) = aVd(n l)$(rf") ~ = o(l). 

We have shown that the numerator of the right hand side of (11) converges to 
in probability. This completes the proof. 

Remark. If the normal approximation is good enough to avoid the left boundary 
term in the integration by parts, we believe the convergence rate to be dominated 
by the quantity 

/•OO 

/ exp(- aVd(z - 5(i))) d$(z). 
Js (1) 

As can be seen in the proof, by (10), 

exp( - aVd{z - 5 (1) )) d$(«) y=- O p I W — — J . 

s ( i) V27t(5 ( i) + aVd) ^ V a / 
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Proof of Proposition 3.4- Let Zi ^ N(0, l),i = 0, . . . , d, and define e = [Z\, . . . , 
Zd] T /\Zq\. That is, e follows the multivariate Cauchy distribution. Let the data be 
defined by Y = X + e. 

Conditional on (Y,Zq), the posterior distribution of X is Gaussian, i.e. 

p(X\Y, Z )=N ( ^ Y l ——^I d 

iy 1 ; Vi + I^ol 2 l + |^o| 2 

We approximate 

||y|| 2 = ||x|| 2 + 2(x, £ ) + || £ || 2 

= d + O(Vd) + 2O(Vd) + ^0^, 

= <i + o(i/^) + l±§^) I 

l z ol 

We have ||F|| 2 = d(l+ T^p-)(l+o(l)), and (1+r^pr) = ^j^(l + o(l)). Substituting 
in p(X\Y, Zq), we obtain 



p(x\y) N ( (i _ ^_ )Jd ) . □ 



\Y\\ 2 \\Y\\ 



Heuristic proof for multivariate Cauchy case 

Assume — > and let \\Y — X^\\ < ■ ■■ < \\Y — X[„]|| be the order statistics. 

_ d + l 

We have Wi cx (l + \\Y — X|| 2 ) 2 , and can write, as usual, W(„) = 1+T , where 
n , , 

(17) T n , d = £ exp( -±-pog(l + ||y - X U] || 2 ) - log(l + ||y - || 2 )]). 
i=2 



[gi ^d] 7 

l-2o| 



The noise vector follows the multivariate Cauchy distribution, i.e. e 
where Zq, Z\, . . . , Z d are iid N(0, 1). Note that 

i + l|y - xii 2 = (i + ||y|| 2 + d )(i + ^n^FT^ } - 

Now, similarly to the developments in Proposition 3.4, given Zo, 

l + \\Y\\* + d = d(2+±){l + O p (dT 1 /*)), 
by the central limit theorem. Moreover, 

(18) log(l + ||y - X!| 2 ) - log(l + ||y - X^jH 2 ) = log(l + - log(l + S (1) ), 
where 

i + ||y|| 2 + d 

(19) = ^%^!^ 2YjXl ^ + P (dr 1/2 ))- 
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Some calculations show that, conditioning on Zq, the asymptotic variance of Si 

is given by " where a 2 (Z ) = . i° . Now, up to a scale factor, the Si's are 

the same (conditionally independent) summands as in the Gaussian-Gaussian case. 
So, we can hope for a uniform Gaussian approximation. 

We proceed as if it were exactly Gaussian with mean and variance given by 
the asymptotic value a ^"• ) . We now expand the logarithm in (19) using Si = 
O p (d^ 1 ' 2 ). With Wi, . . . , W n iid N(0, 1) and with W{\) the minimum, we obtain 

exp( - ^L[log(l + Si) - log(l + 

(20) = exp^^Wi-^W^-^W^-^W^]} 

x (l + o p (l)). 

Since |Wm| = O p (y // \ogn), we heuristically neglect the cubic term in the expression 
which is O p (\W^\/d 1/2 ) = o p {W 2 1} ). 

We now proceed as in the Gaussian-Gaussian case neglecting the o p (l) term. We 
have, 

(21) E(T n ^\W il) ,Z )^(n-l) q -\W (1) )^—— ) , 

where q d (w) = exp( ~ ^a(Z )w + ^§^-w 2 ). Letting A = ^^-,B = ^pl, 
using the approximation <&(Wm) w 1 for the denominator along with Mill's ratio, 
we compute the integral in (21) to get 

(22) E(T n4 \W ilh Z ) « qd[W ^ {l \ AY ^ l ~ A ^ 2W ^ 

+ d^ 2 B(l-A)-^]e X p(^0 T) ) 
n - 1 , W 2 



(23) w -== rexp( 



(1)- 



2^(1 -A) 1 / 2 v 2 y 

x [(1 - A) 1 / 2 ^!) + d 1 ' 2 B{l - Ay 1 ' 2 }" 1 



^[(l - A)V21ogn + VdB] 



(24) 



^[(l - A)r„ + B] 



where r n = ^J^ 1 -> 0. 

We see that collapse occurs (given our approximations) for fixed Zq. We can 
further calculate the average rate of collapse by taking expectation with respect to 
Zq. It can be observed that the average rate is dominated by values of Zq which 

are near 0. In the case of small Z , cf{Zq) = * ; " w 2Zo and the normal density 
is close to -A=. Further Awl- 2Z 2 and B « Z . It then follows from (22) that 



the order of the average rate needed for collapse is J* r r " z dz ~ r n \ logr„|, where 



332 



T. Bengtsson, P. Bickel and B. Li 



e is a small positive number. This is distinctly slower than the Gaussian-Gaussian 
r n rate. 

These heuristics may be made rigorous if logn/ei 1 / 3 -► 0. But since the inte- 
gration by parts remainder terms again dominate, we cannot trace the effect of Zq 
precisely. 

Proof of Proposition 4-1- Withpy(y) = 7^x572 exp(— ||t/|| 2 /4), the marginal density 
of Y, write 



(25) ^Wih(Xi) 



_ 1 v^n ex v[-i\\Yo-X h \\% 



(2ir) d / 2 p Y (Y ) 

Then, the expectation of the numerator under p(X) ~ N(0, Ij), 



En 



h(X) 



exp(-±\\X-Y \\ 2 ) 
{2ir) d / 2 p Y {Y ) 



, exp(-\\\X -Y n \\ 2 -UX\\ 2 ) 



= £ 1 [/ l (x|y )]- 

Specializing to ft. = 1, we obtain that the expectation of the denominator in (25) is 
1. Now consider the variance of the denominator under p(X), 



1 n 



exp{-±\\X,~Y \\ 2 )^ < 1 /■ e ^(-||X-y H 2 )exp(-^ll^ll 2 ) ^ Y 



- (27r) d /V(*o) >-n] x (27r) d / 2 4^exp(-i||ro|| 2 ) 

< iAf 2 (4V3) d 



Thus, if log n/d -► 00, we have Var (± £? =1 ^j^pj^^^) . □ 

Proof of Lemma A. 2. By assumption it is sufficient to prove the result for Xj = 
1, j = l,...,d. We get, 

E\(Z 3 + H ) 2 - (1 + M 2 )| fc = £|(Z 2 - 1) + 2 H Z 3 \ k 

(26) < 2 fc - 1 (^|Z 2 - l| fc + 2 k \fM j \ k E\Z j \ k ). 

Let \x\ denote the smallest integer greater than or equal to x. It is well known 
that 

(27) E\Zj\ k < 2 k / 2 \^~\\ 

Also, with Z'j an iid copy of Zj, then, by Jensen's inequality and (27), 

E\Z 2 - l\ k < E\Z 2 - Z' 2 \ k = 2 k E\^fi\ k E\^fi\ k = 2 k (E\Z 1 \ k ) 2 

(28) <4 fe (r|io 2 . 
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Now, applying Jensen's inequality again, and noting assumption (8), we have 
(29) EN^W^^T)^ 

3=1 3=1 

The lemma follows by combining (26) through (29). □ 
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